class: titleSlide, hide_logo # Data Visualization ## The grammar of graphics and ggplot2 API <br> <center><img src="data:image/png;base64,#logo.png" width="200px"/></center> --- class: left # Today's Agenda .pull-left[ **A. Warm-up** 1. Introductions 2. Setup **B. The grammar of graphics and ggplot2 API** 1. Tidy data 2. The grammar of graphics 3. The `{ggplot2}` API 4. Questions from the readings ] .pull-right[ **C. #TidyTuesday** 1. Dataset review 2. Expectations/process **D. How to get help (time permitting)** ] --- class: bigFocus, hide-count # Introductions In ~30 seconds, tell us about your experience with R and interest in data science. If you use R (or have used it in the past), what is your biggest data processing, visualization, or data communication challenge?
10
:
00
--- class: left, hide_logo, hide-count ## Setup .pull-left[ **Option 1** * Create a folder called `glhlth562` on Box or Dropbox (not for sharing, just for backup) * Create a subfolder called `glhlth562/materials` * Create another subfolder called `glhlth562/prep` * Move your class prep into `glhlth562/prep` * Download and unzip today's materials into `glhlth562/materials` ] .pull-right[ **Option 2** * Clone the repository from https://github.com/ericpgreen/glhlth562 * Create a subfolder called `glhlth562/prep` * Move your class prep into `glhlth562/prep` ] --- class: newTopic, hide_logo # The grammar of graphics and ggplot2 API --- class: newTopicSub, hide_logo # Data science life cycle --- <img src="data:image/png;base64,#img/data-science-cycle/data-science-cycle.004.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- <img src="data:image/png;base64,#img/data-science-cycle/data-science-cycle.007.png" width="90%" style="display: block; margin: auto auto auto 0;" /> --- class: left, hide_logo, hide-count ### No one does this better than the FT's COVID team <iframe src="https://www.ft.com/content/a2901ce8-5eb7-4633-b89c-cbdf5b386938" width="100%" height="90%" data-external="1"></iframe> --- class: left, hide_logo, hide-count ### John Burn-Murdoch on why he uses {ggplot2} <iframe src="https://johnburnmurdoch.github.io/slides/r-ggplot/#/" width="100%" height="90%" data-external="1"></iframe> --- class: newTopicSub, hide_logo # Tidy data and data wrangling <img src="data:image/png;base64,#img/data-science-cycle/data-science-cycle.003.png" width="60%" style="display: block; margin: auto;" /> --- class: left, hide-count ### What is your guess at how the data are structured? <img src="data:image/png;base64,#img/ft2.png" width="80%" style="display: block; margin: auto;" />
01
:
00
--- class: left ### Wide format by state/county <img src="data:image/png;base64,#img/data.png" width="100%" /> --- class: left ### Tidy data According to Hadley Wickham, RStudio's Chief Scientist and creator of many great packages like `{ggplot2}`, [tidy data](https://www.jstatsoft.org/article/view/v059i10) have three characteristics: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. --- class: left .pull-left[ ### Long format <img src="data:image/png;base64,#img/long.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ### State-level data separate <img src="data:image/png;base64,#img/pop.png" width="100%" style="display: block; margin: auto;" /> ] --- class: left ### There is (usually) no free lunch .pull-left[ Data are rarely ready to plot right out of the box. We'll skip the import and wrangling details for now, but take note (right) of what's required in this case. ] .verysmall[ .pull-right[ ``` pop <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv") %>% # keep only the columns we need select(Province_State, Population) %>% # keep only the rows we need filter(Province_State=="North Carolina" | Province_State=="Arizona") %>% # aggregate population group_by(Province_State) %>% summarize(Population = sum(Population)) covid <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv") %>% # keep only the columns we need select(Province_State, Population, contains("/")) %>% # keep only the rows we need filter(Province_State=="North Carolina" | Province_State=="Arizona") %>% # make long pivot_longer(cols = c(-Province_State, -Population), names_to = "date", values_to = "totalDeaths") %>% # convert date to proper date mutate(date = mdy(date)) %>% # aggregate by state and date group_by(Province_State, date) %>% summarize(totalDeaths = sum(totalDeaths)) %>% # calculate daily deaths # end up with some negative values, likely due to data revisions group_by(Province_State) %>% mutate(dailyDeaths = c(totalDeaths[1], diff(totalDeaths))) %>% # make sure series is complete ungroup() %>% complete(Province_State, date, fill = list(0)) %>% # keep only dates in range filter(date >= "2020-08-20" & date <= "2021-07-12") %>% # calculate rolling mean group_by(Province_State) %>% mutate(deaths_roll7 = zoo::rollmean(dailyDeaths, k = 7, fill = NA)) %>% filter(!is.na(deaths_roll7)) %>% filter(date >= "2020-09-01") %>% # merge in pop data and normalize left_join(pop, by="Province_State") %>% mutate(deaths_roll7_100k = (deaths_roll7/Population)*100000) %>% # label the last entry for each state (for the plotting step) mutate(label = if_else(date == max(date), Province_State, NA_character_)) %>% select(Province_State, date, deaths_roll7_100k, label) ``` ] ] --- class: newTopicSub, hide_logo # Grammar of graphics --- class: left, hide-count ### What are the key plot elements? <img src="data:image/png;base64,#img/ft2.png" width="75%" style="display: block; margin: auto;" />
01
:
00
--- class: left ### What are the key plot elements? .pull-left[ <img src="data:image/png;base64,#img/ft2.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ .can-edit[Your ideas] ] --- class: left ### What are the key plot elements? .pull-left[ <img src="data:image/png;base64,#img/ft2.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * Lines by group with group labels * Axis labels * Headline title * Down to business subtitle * Logo * Caption with data source * FT theme * Tool tip ] --- class: left # Grammar of graphics <div style= "float:left;position: relative; top: -20px;left: -40px;"> <img src="img/wilkinson.jpg" style="border: 0;"> </div> `{ggplot2}` implements the Leland Wilkinson's "grammar of graphics". Wickham (2010) describes the grammar of graphics as: "...a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the "scatterplot") and gain insight into the deep structure that underlies statistical graphics." --- class: left <img src="data:image/png;base64,#img/thomas_ggplot.png" width="90%" style="display: block; margin: auto;" /> <span style="font-size: 50%;">Source: Thomas Lin Pederson. Amazing workshop on `ggplot2`: https://github.com/thomasp85/ggplot2_workshop</span> --- class: left <img src="data:image/png;base64,#img/ggplot_gif.gif" width="90%" style="display: block; margin: auto;" /> <span style="font-size: 50%;">Source: Thomas de Beus, https://tinyurl.com/sf8zude</span> --- class: newTopicSub, hide_logo # {ggplot2} API --- class: left ### Let's use this example to explore the API <img src="data:image/png;base64,#img/ft2.png" width="70%" style="display: block; margin: auto;" /> --- class: left ### Open RStudio .pull-left[ * Find and open the `dataviz1_template.Rmd` file in `materials/dataviz1`. * Install packages in `setup` chunk * Scroll to `canvas` chunk * Press the arrow pointing down (We'll dig into the details of RMarkdown files like this later) ] .pull-right[ <img src="data:image/png;base64,#img/open.png" width="100%" style="display: block; margin: auto;" /> ] --- class: left ### `glimpse()` the processed data The data are stored in an object called `covid`. In addition to running `glimpse(covid)`, you can also click on `covid` in the Environment to see a 'spreadsheet' view. ``` ## Rows: 624 ## Columns: 6 ## $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1… ## $ Province_State <chr> "Arizona", "Arizona", "Arizona", "Arizona", "Arizona… ## $ date <date> 2020-09-01, 2020-09-02, 2020-09-03, 2020-09-04, 202… ## $ deaths_roll7 <dbl> 27.57143, 28.57143, 27.28571, 27.14286, 25.28571, 26… ## $ deaths_roll7_100k <dbl> 0.3787952, 0.3925339, 0.3748698, 0.3729072, 0.347392… ## $ label <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ``` --- class: left ### The `ggplot()` function .panelset[ .panel[.panel-name[Code] `ggplot()` tells R that we want to plot something, but we haven't specified any data or told R how the data map to this plot canvas. ```r # see chunk 'canvas' ggplot() ``` ] .panel[.panel-name[Plot] <img src="data:image/png;base64,#dataviz1_deck_files/figure-html/canvas_plot-1.png" width="100%" /> ] ] --- class: left ### Map data to the canvas .panelset[ .panel[.panel-name[Code] The first step is to tell `ggplot()` what data you want to plot and specify how the data maps onto the canvas. Which variable maps to the `x` axis? To the `y` axis? ```r # see chunk 'mapping' ggplot(covid, aes(x = ____, y = ____)) ``` ] .panel[.panel-name[Answer] ```r ggplot(covid, aes(x = date, y = deaths_roll7_100k)) ``` ] .panel[.panel-name[Plot] <img src="data:image/png;base64,#dataviz1_deck_files/figure-html/mapping_plot-1.png" width="100%" /> ] ] --- class: left, hide_logo ### Add a `geom_` <iframe src="https://ggplot2.tidyverse.org/reference/#section-layers" width="100%" height="90%" data-external="1"></iframe> --- class: left ### Add a `geom_` .panelset[ .panel[.panel-name[Code] Our plot has appropriate axes now that we've mapped the data, but the canvas is blank because we've not told R how to represent the data. Our inspiration plot uses lines, so we will too. ```r # see chunk 'geom' ggplot(covid, aes(x = date, y = deaths_roll7_100k)) + geom_line() ``` ] .panel[.panel-name[Plot] <img src="data:image/png;base64,#dataviz1_deck_files/figure-html/geom_plot-1.png" width="100%" /> ] ] --- class: left ### Specify the grouping .panelset[ .panel[.panel-name[Code] This does not look right because we did not tell R that the data are grouped. `covid` is a 'long' dataset with two series: Arizona and North Carolina. Which variable contains this grouping information? ```r # see chunk 'grouping' ggplot(covid, aes(x = date, y = deaths_roll7_100k, group = ____)) + geom_line() ``` ] .panel[.panel-name[Answer] ```r ggplot(covid, aes(x = date, y = deaths_roll7_100k, group = Province_State)) + geom_line() ``` ] .panel[.panel-name[Plot] <img src="data:image/png;base64,#dataviz1_deck_files/figure-html/grouping_plot-1.png" width="100%" /> ] ] --- class: left ### Specify the grouping color .panelset[ .panel[.panel-name[Code] We need to add color to identify each state's line. One point of friction for new users is deciding whether to use `color` or `fill`. Remember this: Lines and points use `color`. ```r # see chunk 'addColor' ggplot(covid, aes(x = date, y = deaths_roll7_100k, color = Province_State)) + geom_line() ``` Rather than specifying `group` explicitly, we can simplify to `color`, which implicitly tells R that we're coloring by group. ] .panel[.panel-name[Plot] <img src="data:image/png;base64,#dataviz1_deck_files/figure-html/addColor_plot-1.png" width="100%" /> ] ] --- class: left ### Modify the style We'll dig into plot refinements later. For now, just take note of the layering approach. .panelset[ .panel[.panel-name[Code] ```r # see chunk 'changeColors' ggplot(covid, # data aes(x = date, y = deaths_roll7_100k, # mapping color = Province_State)) + geom_line() + # geom scale_color_manual(values = c( # theme "#eb5e8d", # pink "#0f5599" # blue )) + ft_theme() ``` ] .panel[.panel-name[Plot] <img src="data:image/png;base64,#dataviz1_deck_files/figure-html/changeColors_plot-1.png" width="70%" /> ] ] --- class: left <img src="data:image/png;base64,#img/thomas_ggplot.png" width="90%" style="display: block; margin: auto;" /> <span style="font-size: 50%;">Source: Thomas Lin Pederson. Amazing workshop on `ggplot2`: https://github.com/thomasp85/ggplot2_workshop</span> --- class: left # Credits Deck by Eric Green ([@ericpgreen](https://twitter.com/ericpgreen)), licensed under Creative Commons Attribution [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) * {[`xaringan`](https://github.com/yihui/xaringan)} for slides with help from {[`xaringanExtra`](https://github.com/gadenbuie/xaringanExtra)} * Data science lifecycle, [Data Science in a Box](https://datasciencebox.org/) * [JHU CSSE COVID-19 data](https://github.com/CSSEGISandData/COVID-19)